A Portuguese-Spanish Corpus Annotated for Subject Realization and Referentiality

نویسندگان

  • Luz Rello
  • Iria Gayo
چکیده

This paper presents a comparable corpus of Portuguese and Spanish consisting of legal and health texts. We describe the annotation of zero subject, impersonal constructions and explicit subjects in the corpus. We annotated 12,492 examples using a scheme that distinguishes between different linguistic levels (phonology, syntax, semantics, etc.) and present a taxonomy of instances on which annotators disagree. The high level of inter-annotator agreement (83%–95%) and the performance of learning algorithms trained on the corpus show that our corpus is a reliable and useful resource.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tagging Portuguese With A Spanish Tagger

We describe a knowledge and resource light system for an automatic morphological analysis and tagging of Brazilian Portuguese.1 We avoid the use of labor intensive resources; particularly, large annotated corpora and lexicons. Instead, we use (i) an annotated corpus of Peninsular Spanish, a language related to Portuguese, (ii) an unannotated corpus of Portuguese, (iii) a description of Portugue...

متن کامل

Tagging Portuguese with a Spanish Tagger Using Cognates

We describe a knowledge and resource light system for an automatic morphological analysis and tagging of Brazilian Portuguese.1 We avoid the use of labor intensive resources; particularly, large annotated corpora and lexicons. Instead, we use (i) an annotated corpus of Peninsular Spanish, a language related to Portuguese, (ii) an unannotated corpus of Portuguese, (iii) a description of Portugue...

متن کامل

Affectedness and Differential Object Marking in Spanish

In this study we investigate the impact of affectedness on the diachronic development of Differential Object Marking (DOM) in Spanish. DOM in Spanish synchronically depends on (i) the referential features of the direct object, such as animacy and referentiality, and (ii) the semantics of the verb. Several studies have also shown that the diachronic development of DOM proceeds along the Animacy ...

متن کامل

ZAC.PB: An Annotated Corpus for Zero Anaphora Resolution in Portuguese

This paper describes the methodology adopted in the construction of an annotated corpus for the study of zero anaphora in Portuguese, the ZAC corpus. To our knowledge, no such corpus exists at this time for the Portuguese language. The purpose of this linguistic resource is to promote the use of automatic discovery of linguistic parameters for anaphora resolution systems. Because of the complex...

متن کامل

Intonational convergence in information-seeking yes-no questions: the case of Olivenza Portuguese and Olivenza Spanish

The present study investigates the realization of informationseeking yes-no questions in two contact varieties, Olivenza Portuguese and Olivenza Spanish, in comparison with Castilian Spanish. We aim to: (1) describe the use of prenuclear pitch accents and nuclear configurations and the durational properties of prenuclear, nuclear, and IP-final syllables in this sentence type; (2) compare the in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012